The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.
Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.
We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.
| model | pass1 | win_rate | count | SE(A) | SE_x(A) | SE_pred(A) |
|---|---|---|---|---|---|---|
| 20250928_trae_doubao_seed_code | 78.8 | 31.6 | 1 | 1.8 | 1.8 | 0 |
| 20250804_epam-ai-run-claude-4-sonnet | 76.8 | 29.7 | 1 | 1.9 | 1.9 | 0 |
| 20250902_atlassian-rovo-dev | 76.8 | 29.5 | 1 | 1.9 | 1.9 | 0 |
| 20250819_ACoder | 76.4 | 29.2 | 1 | 1.9 | 1.9 | 0 |
| 20250901_warp | 75.6 | 29.1 | 1 | 1.9 | 1.9 | 0 |
| 20250612_trae | 75.2 | 28.2 | 1 | 1.9 | 1.9 | 0 |
| 20250731_harness_ai | 74.8 | 27.3 | 1 | 1.9 | 1.9 | 0 |
| 20250720_Lingxi-v1.5_claude-4-sonnet-20250514 | 74.6 | 27.3 | 1 | 1.9 | 1.9 | 0 |
| 20250915_JoyCode | 74.6 | 28.3 | 1 | 1.9 | 1.9 | 0 |
| 20250603_Refact_Agent_claude-4-sonnet | 74.4 | 27.4 | 1 | 2 | 2 | 0 |
| 20250522_tools_claude-4-opus | 73.2 | 27.6 | 1 | 2 | 2 | 0 |
| 20250522_tools_claude-4-sonnet | 72.4 | 26.5 | 1 | 2 | 2 | 0 |
| 20250807_openhands_gpt5 | 71.8 | 26.1 | 1 | 2 | 2 | 0 |
| 20250715_qodo_command | 71.2 | 25.5 | 1 | 2 | 2 | 0 |
| 20250710_bloop | 71.2 | 25.3 | 1 | 2 | 2 | 0 |
| 20250929_Prometheus_v1.2_gpt5 | 71.2 | 26.2 | 1 | 2 | 2 | 0 |
| 20250623_warp | 71 | 25.3 | 1 | 2 | 2 | 0 |
| 20250611_moatless_claude-4-sonnet-20250514 | 70.8 | 24.6 | 1 | 2 | 2 | 0 |
| 20250519_trae | 70.6 | 24.6 | 1 | 2 | 2 | 0 |
| 20250610_augment_agent_v1 | 70.4 | 25.2 | 1 | 2 | 2 | 0 |
| 20250515_Refact_Agent | 70.4 | 24.4 | 1 | 2 | 2 | 0 |
| 20250524_openhands_claude_4_sonnet | 70.4 | 25.1 | 1 | 2 | 2 | 0 |
| 20250519_devlo | 70.2 | 24.3 | 1 | 2 | 2 | 0 |
| 20250430_zencoder_ai | 70 | 24.6 | 1 | 2 | 2 | 0 |
| 20250805_openhands-Qwen3-Coder-480B-A35B-Instruct | 69.6 | 24.6 | 1 | 2.1 | 2.1 | 0 |
| 20250930_zai_glm4-6 | 68.2 | 23.5 | 1 | 2.1 | 2.1 | 0 |
| 20250516_cortexa_o3 | 68.2 | 23.4 | 1 | 2.1 | 2.1 | 0 |
| 20250522_sweagent_claude-4-sonnet-20250514 | 66.6 | 22.6 | 1 | 2.1 | 2.1 | 0 |
| 20250514_aime_coder | 66.4 | 22.1 | 1 | 2.1 | 2.1 | 0 |
| 20250415_openhands | 65.8 | 21.7 | 1 | 2.1 | 2.1 | 0 |
| 20250716_openhands_kimi_k2 | 65.4 | 21.4 | 1 | 2.1 | 2.1 | 0 |
| 20250405_amazon-q-developer-agent-20250405-dev | 65.4 | 21.2 | 1 | 2.1 | 2.1 | 0 |
| 20250316_augment_agent_v0 | 65.4 | 21.1 | 1 | 2.1 | 2.1 | 0 |
| 20250503_patchpilot-v1.1-o4-mini | 64.6 | 21.1 | 1 | 2.1 | 2.1 | 0 |
| 20250117_wandb_programmer_o1_crosscheck5 | 64.6 | 20.8 | 1 | 2.1 | 2.1 | 0 |
| 20250728_zai_glm4-5 | 64.2 | 21 | 1 | 2.1 | 2.1 | 0 |
| 20250206_agentscope | 63.4 | 19.5 | 1 | 2.2 | 2.2 | 0 |
| 20250224_tools_claude-3-7-sonnet | 63.2 | 20.1 | 1 | 2.2 | 2.2 | 0 |
| 20250228_epam-ai-run-claude-3-5-sonnet | 62.8 | 19.8 | 1 | 2.2 | 2.2 | 0 |
| 20250110_blackboxai_agent_v1.1 | 62.8 | 20.6 | 1 | 2.2 | 2.2 | 0 |
| 20250225_sweagent_claude-3-7-sonnet | 62.4 | 19.3 | 1 | 2.2 | 2.2 | 0 |
| 20241221_codestory_midwit_claude-3-5-sonnet_swe-search | 62.2 | 19.3 | 1 | 2.2 | 2.2 | 0 |
| 20250203_openhands_4x_scaled | 60.8 | 18.4 | 1 | 2.2 | 2.2 | 0 |
| 20250901_entroPO_R2E_QwenCoder30BA3B_tts | 60.4 | 19.1 | 1 | 2.2 | 2.2 | 0 |
| 20250110_learn_by_interact_claude3.5 | 60.2 | 20.9 | 1 | 2.2 | 2.2 | 0 |
| 20250629_deepswerl_r2eagent_tts | 58.8 | 18 | 1 | 2.2 | 2.2 | 0 |
| 20250410_cortexa | 58.2 | 17.1 | 1 | 2.2 | 2.2 | 0 |
| 20241213_devlo | 58.2 | 17 | 1 | 2.2 | 2.2 | 0 |
| 20241223_emergent | 57.2 | 16.1 | 1 | 2.2 | 2.2 | 0 |
| 20241208_gru | 57 | 16.4 | 1 | 2.2 | 2.2 | 0 |
| 20250924_artemis_agent_v2 | 57 | 17.4 | 1 | 2.2 | 2.2 | 0 |
| 20250405_swe-rizzo_claude37 | 56.6 | 16.6 | 1 | 2.2 | 2.2 | 0 |
| 20241212_epam-ai-run-claude-3-5-sonnet | 55.4 | 15.1 | 1 | 2.2 | 2.2 | 0 |
| 20241202_amazon-q-developer-agent-20241202-dev | 55 | 15.3 | 1 | 2.2 | 2.2 | 0 |
| 20241108_devlo | 54.2 | 14.9 | 1 | 2.2 | 2.2 | 0 |
| 20250804_codesweep_sweagent_kimi_k2_instruct | 53.4 | 14.9 | 1 | 2.2 | 2.2 | 0 |
| 20250120_Bracket | 53.2 | 15.9 | 1 | 2.2 | 2.2 | 0 |
| 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 | 53 | 14.7 | 1 | 2.2 | 2.2 | 0 |
| 20250901_entroPO_R2E_QwenCoder30BA3B | 52.2 | 14.5 | 1 | 2.2 | 2.2 | 0 |
| 20241212_google_jules_gemini_2.0_flash_experimental | 52.2 | 14.6 | 1 | 2.2 | 2.2 | 0 |
| 20241125_enginelabs | 51.8 | 14.7 | 1 | 2.2 | 2.2 | 0 |
| 20250805_openhands-Qwen3-Coder-30B-A3B-Instruct | 51.6 | 14 | 1 | 2.2 | 2.2 | 0 |
| 20250122_autocoderover-v2.1-claude-3-5-sonnet-20241022 | 51.6 | 13.9 | 1 | 2.2 | 2.2 | 0 |
| 20241202_agentless-1.5_claude-3.5-sonnet-20241022 | 50.8 | 13.9 | 1 | 2.2 | 2.2 | 0 |
| 20241125_marscode-agent-dev | 50 | 13.4 | 1 | 2.2 | 2.2 | 0 |
| 20241028_solver | 50 | 13 | 1 | 2.2 | 2.2 | 0 |
| 20241105_nfactorial | 49.2 | 12.8 | 1 | 2.2 | 2.2 | 0 |
| 20241022_tools_claude-3-5-sonnet-updated | 49 | 12.8 | 1 | 2.2 | 2.2 | 0 |
| 20241025_composio_swekit | 48.6 | 12.3 | 1 | 2.2 | 2.2 | 0 |
| 20241106_navie-2-gpt4o-sonnet | 47.2 | 12.8 | 1 | 2.2 | 2.2 | 0 |
| 20250616_Skywork-SWE-32B+TTS_Bo8 | 47 | 12.1 | 1 | 2.2 | 2.2 | 0 |
| 20250520_openhands_devstral_small | 46.8 | 12 | 1 | 2.2 | 2.2 | 0 |
| 20241023_emergent | 46.6 | 11.8 | 1 | 2.2 | 2.2 | 0 |
| 20241108_autocoderover-v2.0-claude-3-5-sonnet-20241022 | 46.2 | 11.5 | 1 | 2.2 | 2.2 | 0 |
| 20250528_patchpilot_Co-PatcheR | 46 | 11.5 | 1 | 2.2 | 2.2 | 0 |
| 20240924_solver | 45.4 | 11 | 1 | 2.2 | 2.2 | 0 |
| 20240824_gru | 45.2 | 11.2 | 1 | 2.2 | 2.2 | 0 |
| 20250118_codeshellagent_gemini_2.0_flash_experimental | 44.2 | 11.1 | 1 | 2.2 | 2.2 | 0 |
| 20240920_solver | 43.6 | 10.5 | 1 | 2.2 | 2.2 | 0 |
| 20250214_agentless_lite_o3_mini | 42.4 | 11.2 | 1 | 2.2 | 2.2 | 0 |
| 20250527_amazon.nova-premier-v1.0 | 42.4 | 11.2 | 1 | 2.2 | 2.2 | 0 |
| 20250629_deepswerl_r2eagent | 42.2 | 11.2 | 1 | 2.2 | 2.2 | 0 |
| 20250806_SWE-Exp_DeepSeek-V3 | 42 | 9.79 | 1 | 2.2 | 2.2 | 0 |
| 20250112_ugaiforge | 41.6 | 9.55 | 1 | 2.2 | 2.2 | 0 |
| 20241030_nfactorial | 41.6 | 10.3 | 1 | 2.2 | 2.2 | 0 |
| 20250226_swerl_llama3_70b | 41.2 | 10.2 | 1 | 2.2 | 2.2 | 0 |
| 20241113_nebius-search-open-weight-models-11-24 | 40.6 | 9.27 | 1 | 2.2 | 2.2 | 0 |
| 20241016_composio_swekit | 40.6 | 9.21 | 1 | 2.2 | 2.2 | 0 |
| 20241022_tools_claude-3-5-haiku | 40.6 | 9.47 | 1 | 2.2 | 2.2 | 0 |
| 20240820_honeycomb | 40.6 | 9.98 | 1 | 2.2 | 2.2 | 0 |
| 20250511_sweagent_lm_32b | 40.2 | 9.07 | 1 | 2.2 | 2.2 | 0 |
| 20241029_epam-ai-run-claude-3-5-sonnet | 39.6 | 9.3 | 1 | 2.2 | 2.2 | 0 |
| 20241028_agentless-1.5_gpt4o | 38.8 | 9.04 | 1 | 2.2 | 2.2 | 0 |
| 20240721_amazon-q-developer-agent-20240719-dev | 38.8 | 9.4 | 1 | 2.2 | 2.2 | 0 |
| 20240628_autocoderover-v20240620 | 38.4 | 9.31 | 1 | 2.2 | 2.2 | 0 |
| 20250725_sweagent_devstral_small_2507 | 38 | 8.55 | 1 | 2.2 | 2.2 | 0 |
| 20250616_Skywork-SWE-32B | 38 | 8.86 | 1 | 2.2 | 2.2 | 0 |
| 20240617_factory_code_droid | 37 | 8.98 | 1 | 2.2 | 2.2 | 0 |
| 20240620_sweagent_claude3.5sonnet | 33.6 | 7.54 | 1 | 2.1 | 2.1 | 0 |
| 20250306_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor | 32.8 | 7.22 | 1 | 2.1 | 2.1 | 0 |
| 20240612_MASAI_gpt4o | 32.6 | 7.24 | 1 | 2.1 | 2.1 | 0 |
| 20241120_artemis_agent | 32 | 7.05 | 1 | 2.1 | 2.1 | 0 |
| 20241007_nfactorial | 31.6 | 6.47 | 1 | 2.1 | 2.1 | 0 |
| 20241128_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor_20241128 | 30.2 | 6.44 | 1 | 2.1 | 2.1 | 0 |
| 20241002_lingma-agent_lingma-swe-gpt-72b | 28.8 | 6.1 | 1 | 2 | 2 | 0 |
| 20241016_epam-ai-run-gpt-4o | 27 | 5.72 | 1 | 2 | 2 | 0 |
| 20240615_appmap-navie_gpt4o | 26.2 | 5.35 | 1 | 2 | 2 | 0 |
| 20241001_nfactorial | 25.8 | 5.28 | 1 | 2 | 2 | 0 |
| 20240509_amazon-q-developer-agent-20240430-dev | 25.6 | 5.53 | 1 | 2 | 2 | 0 |
| 20240918_lingma-agent_lingma-swe-gpt-72b | 25 | 4.46 | 1 | 1.9 | 1.9 | 0 |
| 20240820_epam-ai-run-gpt-4o | 24 | 4.37 | 1 | 1.9 | 1.9 | 0 |
| 20240728_sweagent_gpt4o | 23.2 | 4.35 | 1 | 1.9 | 1.9 | 0 |
| 20250627_agentless_MCTS-Refine-7B | 23.2 | 6.31 | 1 | 1.9 | 1.9 | 0 |
| 20240402_sweagent_gpt4 | 22.4 | 4.13 | 1 | 1.9 | 1.9 | 0 |
| 20241002_lingma-agent_lingma-swe-gpt-7b | 18.2 | 2.98 | 1 | 1.7 | 1.7 | 0 |
| 20240402_sweagent_claude3opus | 15.8 | 2.44 | 1 | 1.6 | 1.6 | 0 |
| 20240918_lingma-agent_lingma-swe-gpt-7b | 10.2 | 1.37 | 1 | 1.4 | 1.4 | 0 |
| 20240402_rag_claude3opus | 7 | 0.934 | 1 | 1.1 | 1.1 | 0 |
| 20231010_rag_claude2 | 4.4 | 0.62 | 1 | 0.92 | 0.92 | 0 |
| 20240402_rag_gpt4 | 2.8 | 0.362 | 1 | 0.74 | 0.74 | 0 |
| 20231010_rag_swellama7b | 1.4 | 0.411 | 1 | 0.53 | 0.53 | 0 |
| 20231010_rag_swellama13b | 1.2 | 0.266 | 1 | 0.49 | 0.49 | 0 |
| 20231010_rag_gpt35 | 0.4 | 0.0623 | 1 | 0.28 | 0.28 | 0 |